dplyr package is up-to-date by typing packageVersion("dplyr"). If the current installed version is less than 1.0, then update by typing update.packages("dplyr"). You may need to restart R to make it work.ifelse(packageVersion("dplyr") >= 1,
"The installed version of dplyr package is greater than or equal to 1.0.0", update.packages("dplyr")
)
## [1] "The installed version of dplyr package is greater than or equal to 1.0.0"
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(
tidyverse, # the tidyverse framework
here, # computational reproducibility
gapminder, # toy data
nycflights13, # for exercise
ggthemes, # additional themes
ggrepel, # annotating ggplot with text
patchwork, # arranging ggplots
broom # tidying model outputs
)
Signs of messy datasets
Let’s take a look at the cases of untidy data.
Messy Data Case 1 (Source: R for Data Science)
Make It Longer
| Col1 | Col2 | Col3 |
|---|---|---|
Challenge: Why is this data not tidy?
table4a
Concept map for pivoting. By Florian Schmoll, Monica Alonso.
pivot_longer() increases the number of rows (longer) and decreases the number of columns. The inverse function is pivot_wider(). These functions improve the usability of gather() and spread().What pivot_longer() does (Source: https://www.storybench.org)
Concept map for pipe operator. By Jeroen Janssens, Monica Alonso.
%>% originally comes from the magrittr package. The idea behind the pipe operator is similar to what we learned about chaining functions in high school. f: B -> C and g: A -> B can be expressed as \(f(g(x))\). The pipe operator chains operations. When you read pipe operator, read as “and then” (Wickham’s recommendation). The keyboard shortcut is ctrl + shift + M. The key idea here is not creating temporary variables and focusing on verbs (functions). We’ll learn more about this functional programming paradigm later on.table4a
# Old way, less intuitive
table4a %>%
gather(
key = "year", # Current column names
value = "cases", # The values matched to cases
c("1999", "2000")
) # Selected columns
# New way, more intuitive
table4a %>%
pivot_longer(
cols = c("1999", "2000"), # Selected columns
names_to = "year", # Shorter columns (the columns going to be in one column called year)
values_to = "cases"
) # Longer rows (the values are going to be in a separate column called named cases)
There’s another problem, did you catch it?
The data type of year variable should be numeric not character. By default, pivot_longer() transforms uninformative columns to character.
You can fix this problem by using names_transform argument.
table4a %>%
pivot_longer(
cols = c("1999", "2000"), # Put two columns together
names_to = "year", # Shorter columns (the columns going to be in one column called year)
values_to = "cases", # Longer rows (the values are going to be in a separate column called named cases)
names_transform = list(year = readr::parse_number)
) # Transform the variable
Additional tips
parse_number() also keeps only numeric information in a variable.
parse_number("reply1994")
## [1] 1994
A flat file (e.g., CSV) is a rectangular shaped combination of strings. Parsing determines the type of each column and turns into a vector of a more specific type. Tidyverse has parse_ functions (from readr package) that are flexible and fast (e.g., parse_integer(), parse_double(), parse_logical(), parse_datetime(), parse_date(), parse_time(), parse_factor(), etc).
Challenge
pivot function vigenette.) Too long or too wide?billboard
# Old way
billboard %>%
gather(
key = "week",
value = "rank",
starts_with("wk")
) %>% # Use regular expressions
drop_na() # Drop NAs
pivot_longer() is more versatile than gather().# New way
billboard %>%
pivot_longer(
cols = starts_with("wk"), # Use regular expressions
names_to = "week",
values_to = "rank",
values_drop_na = TRUE # Drop NAs
)
Make It Wider
Why is this data not tidy?
table2
Each observation is spread across two rows.
How can you fix it?: pivot_wider().
Two differences between pivot_longer() and pivot_wider()
In pivot_longer(), the arguments are named names_to and values_to (to).
In pivot_wider(), this pattern is opposite. The arguments are named names_from and values_from (from).
The number of required arguments for pivot_longer() is 3 (col, names_to, values_to).
The number of required arguments for pivot_wider() is 2 (names_from, values_from).
What pivot_wider() does (Source: https://www.storybench.org)
# Old way
table2 %>%
spread(
key = type,
value = count
)
# New way
table2 %>%
pivot_wider(
names_from = type, # first
values_from = count # second
)
Sometimes, a consultee came to me and asked: “I don’t have missing values in my original dataframe. Then R said that I have missing values after I’ve done some data transformations. What happened?”
Here’s an answer.
R defines missing values in two ways.
Implicit missing values: simply not present in the data.
Explicit missing values: flagged with NA
Challenge
The example comes from R for Data Science.
stocks <- tibble(
year = c(2019, 2019, 2019, 2020, 2020, 2020),
qtr = c(1, 2, 3, 2, 3, 4),
return = c(1, 2, 3, NA, 2, 3)
)
stocks
Where is the explicit missing value?
Does stocks have implicit missing values?
# implicit missing values become explicit
stocks %>%
pivot_wider(
names_from = year,
values_from = return
)
Challenge
This exercise comes from pivot function vigenette.
Could you make station a series of dummy variables using pivot_wider()?
fish_encounters
Which pivot should you use?
Are there explicit missing values?
How could you turn these NAs into 0s? Check values_fill argument in the pivot_wider() function.
Messy Data Case 2 (Source: R for Data Science)
# Toy example
df <- data.frame(x = c(NA, "Dad.apple", "Mom.orange", "Daughter.banana"))
df
# Separate
df %>%
separate(x, into = c("Name", "Preferred_fruit"))
# Don't need the first variable
df %>%
separate(x, into = c(NA, "Preferred_fruit"))
Practice
table3
sep argument. You can specify how to separate joined values.table3 %>%
separate(rate,
into = c("cases", "population"),
sep = "/"
)
convert argument. You can specify whether automatically convert the new values or not.table3 %>%
separate(rate,
into = c("cases", "population"),
sep = "/",
convert = TRUE
) # cases and population become integers
pivot_longer() <-> pivot_wider()
separate() <-> unite()
# Create a toy example
df <- data.frame(
name = c("Jae", "Sun", "Jane", NA),
birthmonth = c("April", "April", "June", NA)
)
# Include missing values
df %>% unite(
"contact",
c("name", "birthmonth")
)
# Do not include missing values
df %>% unite("contact",
c("name", "birthmonth"),
na.rm = TRUE
)
This is a relatively less-known function of the tidyr package. I found this function super useful to complete time-series data. For instance, how can you replace NA in the following example (this use case is drawn from the tidyr package vignette.)?
# Example
stock <- tibble::tribble(
~ quarter, ~ year, ~stock_price,
"Q1", 2000, 10000,
"Q2", NA, 10001, # Replace NA with 2000
"Q3", NA, 10002, # Replace NA with 2000
"Q4", NA, 10003, # Replace NA with 2000
"Q1", 2001, 10004,
"Q2", NA, 10005, # Replace NA with 2001
"Q3", NA, 10006, # Replace NA with 2001
"Q4", NA, 10007, # Replace NA with 2001
)
fill(stock, year)
Let’s take a slightly more complex example.
# Example
yelp_rate <- tibble::tribble(
~ neighborhood, ~restraurant_type, ~popularity_rate,
"N1", "Chinese", 5,
"N2", NA, 4,
"N3", NA, 3,
"N4", NA, 2,
"N1", "Indian", 1,
"N2", NA, 2,
"N3", NA, 3,
"N4", NA, 4,
"N1", "Mexican", 5
)
fill(yelp_rate, restraurant_type) # default is direction = .down
fill(yelp_rate, restraurant_type, .direction = "up")
Concept map for dplyr. By Monica Alonso, Greg Wilson.
dplyr is better than the base R approaches to data processing:
dbplyr)Arrange
Order rows
dplyr::arrange(mtcars, mpg) # Low to High (default)
dplyr::arrange(mtcars, desc(mpg)) # High to Row
Rename
Rename columns
df <- tibble(y = c(2011, 2012, 2013))
df %>%
rename(
Year = # NEW name
y
) # OLD name
Choose row by logical condition
Single condition
starwars %>%
filter(gender == "feminine") %>%
arrange(desc(height))
The following filtering example was inspired by the suzanbert’s dplyr blog post.
# First example
starwars %>%
filter(height < 180, height > 160) %>%
nrow()
## [1] 24
# Same as above
starwars %>%
filter(height < 180 & height > 160) %>%
nrow()
## [1] 24
# Not same as above
starwars %>%
filter(height < 180 | height > 160) %>%
nrow()
## [1] 81
Challenge
filter(between()) to find characters whose heights are between 180 and 160 and (2) count the number of these observations.df <- tibble(
heights = c(160:180),
char = rep("none", length(c(160:180)))
)
df %>%
filter(between(heights, 161, 179))
# Filter names include ars; `grepl` is a base R function
starwars %>%
filter(grepl("ars", tolower(name)))
# Or, if you prefer dplyr way
starwars %>%
filter(str_detect(tolower(name), "ars"))
# Filter brown and black hair_color
starwars %>%
filter(hair_color %in% c("black", "brown"))
Challenge
Use str_detect() to find characters whose names include “Han”.
starwars %>%
arrange(desc(height)) %>%
slice(1:6)
# For reproducibility
set.seed(1234)
# Old way
starwars %>%
sample_frac(0.10,
replace = FALSE
) # Without replacement
# New way
starwars %>%
slice_sample(
prop = 0.10,
replace = FALSE
)
# Old way
starwars %>%
sample_n(20,
replace = FALSE
) # Without replacement
# New way
starwars %>%
slice_sample(
n = 20,
replace = FALSE
) # Without replacement
# Old way
starwars %>%
top_n(10, height)
# New way
starwars %>%
slice_max(height, n = 10) # Variable first, Argument second
names(msleep)
## [1] "name" "genus" "vore" "order" "conservation"
## [6] "sleep_total" "sleep_rem" "sleep_cycle" "awake" "brainwt"
## [11] "bodywt"
# Only numeric
msleep %>%
dplyr::select(where(is.numeric))
Challenge
Use select(where()) to find only non-numeric columns
msleep %>%
dplyr::select(contains("sleep"))
Select the columns that include either “sleep” or “wt” in their names
Basic R way
grepl is one of the R base pattern matching functions.
msleep[grepl("sleep|wt", names(msleep))]
Challenge
Use select(match()) to find columns whose names include either “sleep” or “wt”.
msleep %>%
dplyr::select(starts_with("b"))
msleep %>%
dplyr::select(ends_with("wt"))
The key idea is you can use Boolean operators (!, &, |)to combine different string pattern matching statements.
msleep %>%
dplyr::select(starts_with("b") & ends_with("wt"))
# By specifying a column
msleep %>%
dplyr::select(order, everything())
msleep %>%
dplyr::select(any_of(c("name", "order"))) %>%
colnames()
## [1] "name" "order"
msleep$week8 <- NA
msleep$week12 <- NA
msleep$week_extra <- 0
msleep %>%
dplyr::select(num_range("week", c(1:12)))
Additional tips
msleep data has nicely cleaned column names. But real-world data are usually messier. The janitor package is useful to fix this kind of problem.
messy_df <- tibble::tribble(~"ColNum1", ~"COLNUM2", ~ "COL & NUM3",
1, 2, 3)
messy_df
pacman::p_load(janitor)
janitor::clean_names(messy_df)
janitor::tabyl() is helpful for doing crosstabulation and a nice alternative to table() function.
# Frequency table; The default output class is table
table(gapminder$country)
##
## Afghanistan Albania Algeria
## 12 12 12
## Angola Argentina Australia
## 12 12 12
## Austria Bahrain Bangladesh
## 12 12 12
## Belgium Benin Bolivia
## 12 12 12
## Bosnia and Herzegovina Botswana Brazil
## 12 12 12
## Bulgaria Burkina Faso Burundi
## 12 12 12
## Cambodia Cameroon Canada
## 12 12 12
## Central African Republic Chad Chile
## 12 12 12
## China Colombia Comoros
## 12 12 12
## Congo, Dem. Rep. Congo, Rep. Costa Rica
## 12 12 12
## Cote d'Ivoire Croatia Cuba
## 12 12 12
## Czech Republic Denmark Djibouti
## 12 12 12
## Dominican Republic Ecuador Egypt
## 12 12 12
## El Salvador Equatorial Guinea Eritrea
## 12 12 12
## Ethiopia Finland France
## 12 12 12
## Gabon Gambia Germany
## 12 12 12
## Ghana Greece Guatemala
## 12 12 12
## Guinea Guinea-Bissau Haiti
## 12 12 12
## Honduras Hong Kong, China Hungary
## 12 12 12
## Iceland India Indonesia
## 12 12 12
## Iran Iraq Ireland
## 12 12 12
## Israel Italy Jamaica
## 12 12 12
## Japan Jordan Kenya
## 12 12 12
## Korea, Dem. Rep. Korea, Rep. Kuwait
## 12 12 12
## Lebanon Lesotho Liberia
## 12 12 12
## Libya Madagascar Malawi
## 12 12 12
## Malaysia Mali Mauritania
## 12 12 12
## Mauritius Mexico Mongolia
## 12 12 12
## Montenegro Morocco Mozambique
## 12 12 12
## Myanmar Namibia Nepal
## 12 12 12
## Netherlands New Zealand Nicaragua
## 12 12 12
## Niger Nigeria Norway
## 12 12 12
## Oman Pakistan Panama
## 12 12 12
## Paraguay Peru Philippines
## 12 12 12
## Poland Portugal Puerto Rico
## 12 12 12
## Reunion Romania Rwanda
## 12 12 12
## Sao Tome and Principe Saudi Arabia Senegal
## 12 12 12
## Serbia Sierra Leone Singapore
## 12 12 12
## Slovak Republic Slovenia Somalia
## 12 12 12
## South Africa Spain Sri Lanka
## 12 12 12
## Sudan Swaziland Sweden
## 12 12 12
## Switzerland Syria Taiwan
## 12 12 12
## Tanzania Thailand Togo
## 12 12 12
## Trinidad and Tobago Tunisia Turkey
## 12 12 12
## Uganda United Kingdom United States
## 12 12 12
## Uruguay Venezuela Vietnam
## 12 12 12
## West Bank and Gaza Yemen, Rep. Zambia
## 12 12 12
## Zimbabwe
## 12
# Frequency table (unique value, n, percentage)
janitor::tabyl(gapminder$country)
# If you want to add percentage ...
gapminder %>%
tabyl(country) %>%
adorn_pct_formatting(digits = 0, affix_sign = TRUE)
You can think of case_when() (multiple conditions) as an extended version of ifelse() (binary conditions).
mtcars <- mtcars %>%
mutate(cyl_dummy = case_when(cyl > median(cyl) ~ "High", # if condition
cyl < median(cyl) ~ "Low", # else if condition
TRUE ~ 'Median')) # else condition
mtcars %>% pull(cyl_dummy)
## [1] "Median" "Median" "Low" "Median" "High" "Median" "High" "Low"
## [9] "Low" "Median" "Median" "High" "High" "High" "High" "High"
## [17] "High" "Low" "Low" "Low" "Low" "High" "High" "High"
## [25] "High" "Low" "Low" "Low" "High" "Median" "High" "Low"
mtcars %>%
mutate(cyl_dummy = recode(cyl_dummy, # Target column
"High" = "2", # Old - New
"Low" = "0",
"Median" = "1")) %>%
pull(cyl_dummy)
## [1] "1" "1" "0" "1" "2" "1" "2" "0" "0" "1" "1" "2" "2" "2" "2" "2" "2" "0" "0"
## [20] "0" "0" "2" "2" "2" "2" "0" "0" "0" "2" "1" "2" "0"
gapminder %>%
count(continent)
# Just add a new argument `sort = TRUE`
gapminder %>%
count(continent, sort = TRUE)
# Same as above; How nice!
gapminder %>%
count(continent) %>%
arrange(desc(n))
Challenge
Count the number of observations per continent and year and arrange them in descending order.
Let’s take a deeper look at how things work under the hood.
tally() works similar to nrow(): Calculate the total number of cases in a dataframe
count = group_by() + tally()
gapminder %>%
tally()
add_tally() = mutate(n = n())Challenge
What does n in the below example represent?
gapminder %>%
dplyr::select(continent, country) %>%
add_tally()
add_countAdd count as a column.
# Add count as a column
gapminder %>%
group_by(continent) %>%
add_count(year)
Challenge
Do cases 1 and 2 in the below code chunk produce the same outputs? If so, why?
# Case 1
gapminder %>%
group_by(continent, year) %>%
count()
# Case 2
gapminder %>%
group_by(continent) %>%
count(year)
count() is a simple function, but it is still helpful to learn an essential concept underlying complex data wrangling: split-apply-combine strategy. For more information, read Wickham’s article (2011) “The Split-Apply-Combine Strategy for Data Analysis” published in the Journal of Statistical Software (especially pages 7-8). plyr was the package (retired) that demonstrated this idea, which has evolved into two directions: dplyr (for data frames) and purrr (for lists)
summarise() as an extended version of count().gapminder %>%
group_by(continent) %>%
summarise(
n = n(),
mean_gdp = mean(gdpPercap),
sd_gdp = sd(gdpPercap)
)
tablea <- gapminder %>%
group_by(continent) %>%
summarise(
n = n(),
mean_gdp = mean(gdpPercap),
sd_gdp = sd(gdpPercap)
)
pacman::p_load(kableExtra,
flextable)
# For HTML and LaTeX
tablea %>% kableExtra::kable()
| continent | n | mean_gdp | sd_gdp |
|---|---|---|---|
| Africa | 624 | 2193.755 | 2827.930 |
| Americas | 300 | 7136.110 | 6396.764 |
| Asia | 396 | 7902.150 | 14045.373 |
| Europe | 360 | 14469.476 | 9355.213 |
| Oceania | 24 | 18621.609 | 6358.983 |
# For HTML and MS Office suite
tablea %>% flextable::flextable()
continent | n | mean_gdp | sd_gdp |
Africa | 624 | 2,193.755 | 2,827.930 |
Americas | 300 | 7,136.110 | 6,396.764 |
Asia | 396 | 7,902.150 | 14,045.373 |
Europe | 360 | 14,469.476 | 9,355.213 |
Oceania | 24 | 18,621.609 | 6,358.983 |
Old way
summarise_all()
# Create a wide-shaped data example
wide_gapminder <- gapminder %>%
filter(continent == "Europe") %>%
pivot_wider(
names_from = country,
values_from = gdpPercap
)
# Apply summarise_all
wide_gapminder %>%
dplyr::select(-c(1:4)) %>%
summarise_all(mean, na.rm = TRUE)
summarise_if(): using a logical conditionwide_gapminder %>%
summarise_if(is.double, mean, na.rm = TRUE)
summarise_at()
vars() = select()
wide_gapminder %>%
summarise_at(vars(-c(1:4)),
mean,
na.rm = TRUE
)
wide_gapminder %>%
summarise_at(vars(contains("life")),
mean,
na.rm = TRUE
)
Additional tips
Concept map for regular expressions. By Monica Alonso, Greg Wilson.
New way
summarise() + across()
Concept map for across. By Emma Vestesson
If you find using summarise_all(), summarise_if() and summarise_at() confusing, here’s a solution: use summarise() with across().
summarise_all()
wide_gapminder %>%
summarise(across(Albania:`United Kingdom`, mean, na.rm = TRUE))
wide_gapminder %>%
summarise(across(-c(1:4), mean, na.rm = TRUE))
summarise_if()wide_gapminder %>%
summarise(across(is.double, mean, na.rm = TRUE))
## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.double)
##
## # Good
## data %>% select(where(is.double))
##
## ℹ Please update your code.
## This message is displayed once per session.
summarise_at()wide_gapminder %>%
summarise(across(-c(1:4),
mean,
na.rm = TRUE
))
wide_gapminder %>%
summarise(across(contains("life"),
mean,
na.rm = TRUE
))
wide_gapminder %>%
summarise(across(contains("A", ignore.case = FALSE)))
Note that this workshop does not cover creating and manipulating variables using mutate() because many techniques you learned from playing with summarise() can be directly applied to mutate().
Challenge
Summarize the average GDP of countries whose names starting with the alphabet “A”.
Turn the summary dataframe into a publishable table using either kableExtra or flextable package.
Calculate the mean of gdpPercap.
Some functions are designed to work together. For instance, the group_by function defines the strata that you’re going to use for summary statistics. Then, use summarise() or summarize() for producing summary statistics.
gapminder %>%
group_by(continent) %>% #
summarise(mean_gdp = mean(gdpPercap))
gapminder %>%
group_by(continent) %>% #
summarise(
mean_gdp = mean(gdpPercap),
count = n()
)
Optional
median(x), sd(x), IQR(x), mad(x) (the median absolute deviation)# The Interquartile Range = The Difference Between 75t and 25t Percentiles
gapminder %>%
group_by(continent) %>% #
summarise(IQR_gdp = IQR(gdpPercap))
min(x), quantile(x, 0.25), max(x)gapminder %>%
group_by(continent) %>% #
summarise(
min_gdp = min(gdpPercap),
max_gdp = max(gdpPercap)
)
first(x), last(x), nth(x, 2)gapminder %>%
group_by(continent) %>%
summarise(
first_gdp = first(gdpPercap),
last_gdp = last(gdpPercap)
)
gapminder %>%
group_by(continent) %>%
arrange(gdpPercap) %>% # Adding arrange
summarise(
first_gdp = first(gdpPercap),
last_gdp = last(gdpPercap)
)
n(x) (all rows), sum(!is.na(x)) (only non-missing rows) = n_distinct(x)gapminder %>%
group_by(continent) %>%
summarise(ns = n())
sum(condition about x) (the number of TRUEs in x), mean(condition about x) (the proportion of TRUEs in x)gapminder %>%
group_by(continent) %>%
summarise(rich_countries = mean(gdpPercap > 20000))
Additional tips
Also, check out window functions such as cumsum() and lag(). Window functions are a variant of aggregate functions that take a vector as input then return a vector of the same length as an output.
vec <- c(1:10)
# Typical aggregate function
sum(vec) # The output length is one
## [1] 55
# Window function
cumsum(vec) # The output length is ten
## [1] 1 3 6 10 15 21 28 36 45 55
Relational data = multiple tables of data
Relational data example
Key ideas
# Example
planes$tailnum %>% head()
## [1] "N10156" "N102UW" "N103US" "N104UW" "N10575" "N105UW"
Verify primary key
tailnum should be unique.
Challenge
What do you expect the outcome?
planes %>%
count(tailnum) %>%
filter(n > 1)
Optional
If a dataframe doesn’t have a primary key, you can add one called a surrogate key.
# Toy example
df <- tibble(
x = c(1:3),
y = c(4:6)
)
# Add a row_index column
df <- df %>% rowid_to_column("ID")
flights$tailnum %>% head()
## [1] "N14228" "N24211" "N619AA" "N804JB" "N668DN" "N39463"
For joining, don’t be distracted by other details and focus on KEYS!
Add new variables to one data frame from matching observations in another" Using a simple toy example is great because it is easy to see how things work in that much narrow context.
# Table 1
x <- tibble(
key = c(1:4),
val_x = c("x1", "x2", "x3", "x4")
)
# Table 2
y <- tibble(
key = c(1:5),
val_y = c("y1", "y2", "y3", "y4", "y5")
)
inner_join() keeps the matched values in both tables. If the left table is a subset of the right table, then left_join() is the same as inner_join().
Challenge
What is going to be the shared keys?
inner_join(x, y)
## Joining, by = "key"
Mutating joins
left_join(), right_join() and full_join() are outer join functions. Unlike inner_join(), outer join functions keep observations that appear in at least one of the tables.
left_join() keeps only the matched observations in the right table.
left_join(x, y)
## Joining, by = "key"
right_join() does the opposite.
right_join(x, y)
## Joining, by = "key"
full_join() keeps the observations from both tables. If they were unmatched, then NAs were recoded in one of the two tables.
full_join(x, y)
## Joining, by = "key"
Filter observations from one data frame based on whether they match an observation in the other table. - Semi Join
In SQL, this type of query is also called subqueries.
# Create the list of the top 10 destinations
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
top_n(10)
## Selecting by n
# Filter
filtered <- flights %>%
filter(dest %in% top_dest$dest)
joined <- flights %>%
semi_join(top_dest)
## Joining, by = "dest"
head(filtered == joined)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## time_hour
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
anti_join() does the opposite. Exclude the rows that were matched between the two tables. A great technique to filter stopwords when you do computational text analysis.
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
The following example comes from R for Data Science by Garrett Grolemund and Hadley Wickham.
Hadley Wickham: Managing many models with R
Grouped data: each row = an observation
Nested data: each row = a group
Challenge
In the following example, why did we use country and continent for nesting variables?
nested <- gapminder %>%
group_by(country, continent) %>%
nest()
head(nested)
nested$data %>% pluck(1)
lm_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
# Apply m_model to the nested data
nested <- nested %>%
mutate(models = map(data, lm_model)) # Add the list object as a new column
head(nested)
S3 is part of R’s object-oriented systems. If you need further information, check out this section in Hadley’s Advanced R.
glance() function from broom package inspects the quality of a statistical model.
Additional tips
broom::glance(model): for evaluating model quality and/or complexitybroom::tidy(model): for extracting each coefficient in the model (the estimates + its variability)broom::augment(model, data): for getting extra values (residuals, and influence statistics). A convenient tool in case if you want to plot fitted values and raw data together.Broom: Converting Statistical Models to Tidy Data Frames by David Robinson
glanced <- nested %>%
mutate(glance = map(models, broom::glance))
# Pluck the first item on the list
glanced$glance %>% pluck(1)
# Pull p.value
glanced$glance %>% pluck(1) %>% pull(p.value)
## value
## 9.835213e-08
unnest() unpacks the list objects stored in the glanced column
glanced %>%
unnest(glance) %>%
arrange(r.squared)
glanced %>%
unnest(glance) %>%
ggplot(aes(continent, r.squared)) +
geom_jitter(width = 0.5)
nested <- gapminder %>%
group_by(continent) %>%
nest()
nested <- nested %>%
mutate(models = map(data, ~lm(lifeExp ~ year + country, data = .)))
tidied <- nested %>%
mutate(tidied = map(models, broom::tidy))
model_out <- tidied %>%
unnest(tidied) %>%
mutate(term = str_replace(term, "country", "")) %>%
select(continent, term, estimate, p.value) %>%
mutate(p_threshold = ifelse(p.value < 0.05, 1, 0))
model_out %>% filter(p_threshold == 1) %>% pull(term) %>% unique()
## [1] "(Intercept)" "year"
## [3] "Bahrain" "Bangladesh"
## [5] "Cambodia" "China"
## [7] "Hong Kong, China" "India"
## [9] "Indonesia" "Iran"
## [11] "Iraq" "Israel"
## [13] "Japan" "Jordan"
## [15] "Korea, Dem. Rep." "Korea, Rep."
## [17] "Kuwait" "Lebanon"
## [19] "Malaysia" "Mongolia"
## [21] "Myanmar" "Nepal"
## [23] "Oman" "Pakistan"
## [25] "Philippines" "Saudi Arabia"
## [27] "Singapore" "Sri Lanka"
## [29] "Syria" "Taiwan"
## [31] "Thailand" "Vietnam"
## [33] "West Bank and Gaza" "Yemen, Rep."
## [35] "Austria" "Belgium"
## [37] "Croatia" "Czech Republic"
## [39] "Denmark" "Finland"
## [41] "France" "Germany"
## [43] "Greece" "Iceland"
## [45] "Ireland" "Italy"
## [47] "Montenegro" "Netherlands"
## [49] "Norway" "Poland"
## [51] "Portugal" "Slovak Republic"
## [53] "Slovenia" "Spain"
## [55] "Sweden" "Switzerland"
## [57] "Turkey" "United Kingdom"
## [59] "Angola" "Benin"
## [61] "Botswana" "Burkina Faso"
## [63] "Burundi" "Cameroon"
## [65] "Central African Republic" "Chad"
## [67] "Comoros" "Congo, Dem. Rep."
## [69] "Congo, Rep." "Cote d'Ivoire"
## [71] "Djibouti" "Equatorial Guinea"
## [73] "Eritrea" "Ethiopia"
## [75] "Gabon" "Gambia"
## [77] "Ghana" "Guinea"
## [79] "Guinea-Bissau" "Kenya"
## [81] "Lesotho" "Liberia"
## [83] "Madagascar" "Malawi"
## [85] "Mali" "Mauritania"
## [87] "Mauritius" "Mozambique"
## [89] "Namibia" "Niger"
## [91] "Nigeria" "Reunion"
## [93] "Rwanda" "Senegal"
## [95] "Sierra Leone" "Somalia"
## [97] "South Africa" "Sudan"
## [99] "Swaziland" "Tanzania"
## [101] "Togo" "Uganda"
## [103] "Zambia" "Zimbabwe"
## [105] "Bolivia" "Brazil"
## [107] "Canada" "Colombia"
## [109] "Dominican Republic" "Ecuador"
## [111] "El Salvador" "Guatemala"
## [113] "Haiti" "Honduras"
## [115] "Mexico" "Nicaragua"
## [117] "Paraguay" "Peru"
## [119] "Puerto Rico" "Trinidad and Tobago"
## [121] "United States" "Venezuela"
## [123] "New Zealand"
model_out %>% filter(p_threshold == 0) %>% pull(term) %>% unique()
## [1] "Bosnia and Herzegovina" "Bulgaria" "Hungary"
## [4] "Romania" "Serbia" "Egypt"
## [7] "Libya" "Morocco" "Sao Tome and Principe"
## [10] "Tunisia" "Chile" "Costa Rica"
## [13] "Cuba" "Jamaica" "Panama"
## [16] "Uruguay"
We tasted a little bit about how map() function works. Let’s dig into it more in-depth as this family of functions is useful. For more information, see Rebecca Barter’s excellent tutorial on the purrr package. In her words, this is “the tidyverse’s answer to apply functions for iteration”. map() function can take a vector (of any type), a list, and a dataframe for input.
multiply <- function(x) {
x * x
}
df <- list(
first_obs = rnorm(7, 1, sd = 1),
second_obs = rnorm(7, 2, sd = 2)
) # normal distribution
Challenge
Try map_df(.x = df, .f = multiply) and tell me what’s the difference between the output you got and what you saw earlier.
If you want to know more about the power and joy of functional programming in R (e.g., purrr::map()), then please take “How to Automate Repeated Things in R” workshop.
The following material is adapted from Kieran Healy’s excellent book (2019) on data visualization and Hadley Wickham’s equally excellent book on ggplot2. For more theoretical discussions, I recommend you to read The Grammar of Graphics by Leland Wilkinson.
Why should we care about data visualization? More precisely, why should we learn the grammar of statistical graphics?
Sometimes, pictures are better tools than words in 1) exploring, 2) understanding, and 3) explaining data.
Anscombe’s quarter comprises four datasets, which are so alike in terms of their descriptive statistics but quite different when presented graphically.
# Set theme
theme_set(theme_minimal())
# Data
anscombe
# Correlation
cor(anscombe)[c(1:4), c(5:8)]
## y1 y2 y3 y4
## x1 0.8164205 0.8162365 0.8162867 -0.3140467
## x2 0.8164205 0.8162365 0.8162867 -0.3140467
## x3 0.8164205 0.8162365 0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
# gather and select
anscombe_processed <- anscombe %>%
gather(x_name, x_value, x1:x4) %>%
gather(y_name, y_value, y1:y4)
# plot
anscombe_processed %>%
ggplot(aes(x = x_value, y = y_value)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
facet_grid(x_name ~ y_name) +
theme_bw() +
labs(
x = "X values",
y = "Y values",
title = "Anscombe's quartet"
)
## `geom_smooth()` using formula 'y ~ x'
the grammar of graphics
No worries about new terms. We’re going to learn them by actually plotting.
Workflow:
aes (aesthetic mappings or aesthetics) tells which variables (x, y) in your data should be represented by which visual elements (color, shape, size) in the plot.
geom_ tells the type of plot you are going to use
p <- ggplot(
data = gapminder,
mapping = aes(x = gdpPercap, y = lifeExp)
) # ggplot or R in general takes positional arguments too. So, you don't need to name data, mapping each time you use ggplot2.
p
p + geom_point()
p + geom_point() + geom_smooth() # geom_smooth has calculated a smoothed line;
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# the shaded area is the standard error for the line
geom_histogram(): For the probability distribution of a continuous variable. Bins divide the entire range of values into a series of intervals (see the Wiki entry).geom_density(): Also for the probability distribution of a continuous variable. It calculates a kernel density estimate of the underlying distribution.data(midwest) # load midwest dataset
midwest
midwest %>%
ggplot(aes(x = area)) +
geom_point() # not working.
midwest %>%
ggplot(aes(x = area)) +
geom_histogram() # stat_bin argument picks up 30 bins (or "bucket") by default.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
midwest %>%
ggplot(aes(x = area)) +
geom_histogram(bins = 10) # only 10 bins.
ggplot(
data = subset(midwest, state %in% c("OH", "IN")),
mapping = aes(x = percollege, fill = state)
) +
geom_histogram(alpha = 0.7, bins = 20) +
scale_fill_viridis_d()
midwest %>%
ggplot(aes(x = area, fill = state, color = state)) +
geom_density(alpha = 0.3) +
scale_color_viridis_d() +
scale_fill_viridis_d()
There’s also fill argument (mostly used in geom_bar()). Color aes affects the appearance of lines and points, fill is for the filled areas of bars, polygons, and in some cases, the interior of a smoother’s standard error ribbon.
The property size/color/fill represents…
ggplot(
data = gapminder,
mapping = aes(
x = gdpPercap, y = lifeExp,
size = pop
)
) +
geom_point()
ggplot(
data = gapminder,
mapping = aes(
x = gdpPercap, y = lifeExp,
size = pop,
color = continent
)
) +
geom_point() +
scale_color_viridis_d()
# try red instead of "red"
ggplot(
data = gapminder,
mapping = aes(
x = gdpPercap, y = lifeExp,
size = pop,
color = "red"
)
) +
geom_point()
Aesthetics also can be mapped per Geom.
p + geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
p + geom_point(alpha = 0.3) + # alpha controls transparency
geom_smooth(color = "red", se = FALSE, size = 2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
p + geom_point(alpha = 0.3) + # alpha controls transparency
geom_smooth(color = "red", se = FALSE, size = 2, method = "lm")
## `geom_smooth()` using formula 'y ~ x'
ggplot(
data = gapminder,
mapping = aes(
x = gdpPercap, y = lifeExp,
color = continent
)
) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", color = "red") +
labs(
x = "log GDP",
y = "Life Expectancy",
title = "A Gapminder Plot",
subtitle = "Data points are country-years",
caption = "Source: Gapminder"
)
## `geom_smooth()` using formula 'y ~ x'
ggplot(
data = gapminder,
mapping = aes(
x = gdpPercap, y = lifeExp,
color = continent,
fill = continent
)
) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", color = "red") +
labs(
x = "log GDP",
y = "Life Expectancy",
title = "A Gapminder Plot",
subtitle = "Data points are country-years",
caption = "Source: Gapminder"
) +
scale_color_viridis_d() +
scale_fill_viridis_d()
## `geom_smooth()` using formula 'y ~ x'
p + geom_point() +
coord_flip() # coord_type
The data is heavily bunched up against the left side.
p + geom_point() # without scaling
p + geom_point() +
scale_x_log10() # scales the axis of a plot to a log 10 basis
p + geom_point() +
geom_smooth(method = "lm") +
scale_x_log10()
## `geom_smooth()` using formula 'y ~ x'
scales package has some useful premade formatting functions. You can either load scales or just grab the function you need from the library using scales::
p + geom_point(alpha = 0.3) +
geom_smooth(method = "loess", color = "red") +
scale_x_log10(labels = scales::dollar) +
labs(
x = "log GDP",
y = "Life Expectancy",
title = "A Gapminder Plot",
subtitle = "Data points are country-years",
caption = "Source: Gapminder"
)
## `geom_smooth()` using formula 'y ~ x'
p + geom_point(alpha = 0.3) +
geom_smooth(method = "loess", color = "red") +
scale_x_log10(labels = scales::dollar) +
labs(
x = "log GDP",
y = "Life Expectancy",
title = "A Gapminder Plot",
subtitle = "Data points are country-years",
caption = "Source: Gapminder"
) +
theme_economist()
## `geom_smooth()` using formula 'y ~ x'
figure_example <- p + geom_point(alpha = 0.3) +
geom_smooth(method = "gam", color = "red") +
scale_x_log10(labels = scales::dollar) +
labs(
x = "log GDP",
y = "Life Expectancy",
title = "A Gapminder Plot",
subtitle = "Data points are country-years",
caption = "Source: Gapminder"
) +
theme_economist()
ggsave(figure_example, here("outputs", "figure_example.png"))
Basic ideas:
ggplot2 about the structure of your datap <- ggplot(gapminder, aes(x = year, y = gdpPercap))
p + geom_point()
p + geom_line()
geom_line joins up all the lines for each particular year in the order they appear in the dataset. ggplot2 does not know the yearly observations in your data are grouped by country.
Note that you need grouping when the grouping information you need to tell is not built into the mapped variables (like continent).
gapminder
Facetting is to make small multiples.
facet_wrap: based on a single categorical variable like facet_wrap(~single_categorical_variable). Your panels will be laid out in order and then wrapped into a grid.
facet_grid: when you want to cross-classify some data by two categorical variables like facet_grid(one_cat_variable ~ two_cat_variable).
p <- ggplot(gapminder, aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) # group by, # The outlier is Kuwait.
p + geom_line(aes(group = country)) + facet_wrap(~continent) # facetting
p + geom_line(aes(group = country), color = "gray70") +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels = scales::dollar) +
facet_wrap(~continent, ncol = 5) + # for single categorical variable; for multiple categorical variables use facet_grid()
labs(
x = "Year",
y = "GDP per capita",
title = "GDP per capita on Five continents"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula 'y ~ x'
p + geom_line(aes(group = country), color = "gray70") +
geom_smooth(size = 1.1, method = "loess", se = FALSE) +
scale_y_log10(labels = scales::dollar) +
facet_grid(~continent) + # for single categorical variable; for multiple categorical variables use facet_grid()
labs(
x = "Year",
y = "GDP per capita",
title = "GDP per capita on Five continents"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula 'y ~ x'
Also, we experiment bar charts here. By default, geom_bar uses stat = “bins”, which makes the height of each bar equal to the number of cases in each group. If you have a y column, then you should use stat = "identity" argument. Alternatively, you can use geom_col().
gapminder_formatted <- gapminder %>%
group_by(continent, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
ggplot(data = gapminder_formatted, aes(x = year, y = lifeExp_mean, color = continent)) +
geom_point() +
labs(
x = "Year",
y = "Life expectancy",
title = "Life expectancy on Five continents"
)
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = year, y = lifeExp_mean, color = country)) +
geom_point() +
labs(
x = "Year",
y = "Life expectancy",
title = "Life expectancy in Europe"
)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
# geom point
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = year, y = lifeExp_mean)) +
geom_point() +
labs(
x = "Year",
y = "Life expectancy",
title = "Life expectancy in Europe"
) +
facet_wrap(~country)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
# geom bar
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = year, y = lifeExp_mean)) +
geom_bar(stat = "identity") +
labs(
x = "Year",
y = "Life expectancy",
title = "Life expectancy in Europe"
) +
facet_wrap(~country)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
# no facet
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = year, y = lifeExp_mean, fill = country)) +
geom_bar(stat = "identity") + # even if you not stack, still the plot looks messy or you can use geom_col()
labs(
x = "Year",
y = "Life expectancy",
title = "Life expectancy in Europe"
)
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = country, y = lifeExp_mean)) +
geom_boxplot() +
labs(
x = "Country",
y = "Life expectancy",
title = "Life expectancy in Europe"
) +
coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
# without ordering
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = reorder(country, lifeExp_mean), y = lifeExp_mean)) +
geom_boxplot() +
labs(
x = "Country",
y = "Life expectancy",
title = "Life expectancy in Europe"
) +
coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
# reorder
gapminder %>%
filter(continent == "Europe") %>%
group_by(country, year) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = reorder(country, -lifeExp_mean), y = lifeExp_mean)) +
geom_boxplot() +
labs(
x = "Country",
y = "Life expectancy",
title = "Life expectancy in Europe"
) +
coord_flip()
## `summarise()` has grouped output by 'country'. You can override using the `.groups` argument.
gapminder %>%
filter(continent == "Asia" | continent == "Americas") %>%
group_by(continent, country) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
geom_point() +
geom_text(aes(label = country)) +
scale_x_log10() +
facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
# with label
gapminder %>%
filter(continent == "Asia" | continent == "Americas") %>%
group_by(continent, country) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
geom_point() +
geom_label(aes(label = country)) +
scale_x_log10() +
facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
# no overlaps
gapminder %>%
filter(continent == "Asia" | continent == "Americas") %>%
group_by(continent, country) %>%
summarize(
gdp_mean = mean(gdpPercap),
lifeExp_mean = mean(lifeExp)
) %>%
ggplot(aes(x = gdp_mean, y = lifeExp_mean)) +
geom_point() +
geom_text_repel(aes(label = country)) + # there's also geom_label_repel
scale_x_log10() +
facet_grid(~continent)
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
## Warning: ggrepel: 2 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
In plotting models, we extensively use David Robinson’s broom package in R. The idea is to transform model outputs (i.e., predictions and estimations) into tidy objects so that we can easily combine, separate, and visualize these elements.
model_colors <- RColorBrewer::brewer.pal(3, "Set1") # select three qualitatively different colors from a larger palette.
gapminder %>%
ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
geom_smooth(
method = "lm", formula = y ~ splines::bs(x, df = 3),
aes(color = "Cubic Spline", fill = "Cubic Spline")
) +
geom_smooth(method = "loess", aes(color = "LOESS", fill = "LOESS")) +
theme(legend.position = "top") +
scale_color_manual(name = "Models", values = model_colors) +
scale_fill_manual(name = "Models", values = model_colors)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# regression model
out <- lm(
formula = lifeExp ~ gdpPercap + pop + continent,
data = gapminder
)
tidy() is a method in the broom package. It “constructs a dataframe that summarizes the model’s statistical findings”. As the description states, tidy is a function that can be used for various models. For instance, a tidy can extract following information from a regression model.
Term: a term being estimatedp.valuestatistic: a test statistic used to compute p-valueestimateconf.low: the low end of a confidence intervalconf.high: the high end of a confidence intervaldf: degrees of freedomChallenge
Try glance(out), what did you get from these commands? If you’re curious, you can try ?glance.
The followings are to show your degree of confidence.
# estimates
out_comp <- tidy(out)
p <- out_comp %>%
ggplot(aes(x = term, y = estimate))
p + geom_point() +
coord_flip() +
theme_bw()
# plus confidence intervals
out_conf <- tidy(out, conf.int = TRUE)
# plotting coefficients using ggplot2 (pointrange)
out_conf %>%
ggplot(aes(x = reorder(term, estimate), y = estimate, ymin = conf.low, ymax = conf.high)) +
geom_pointrange() +
coord_flip() +
labs(x = "", y = "OLS Estimate") +
theme_bw()
# another way to do it (errorbar)
out_conf %>%
ggplot(aes(x = estimate, y = reorder(term, estimate))) +
geom_point() +
geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
labs(y = "", x = "OLS Estimate") +
theme_bw()
You can calculate marginal effects using the margins package. For the sake of time, I’m not covering that here.